1 Introduction

It is September of 2020, right in the middle of the COVID-19 pandemic and during the California statewide shutdown. You have just been hired by the California state public health agency to track mortality trends for the past 4 years for which all data have currently been aggregated and de-identified (anonymized). You were given properly de-identified data from 2017 - 2020, which contain mortality records for the state of California by gender, leading cause of death, race / ethnicity, and location of death. All data have been taken from the death certificates filed with the state.

You are provided 6 tables total, which have the following variables.

Table 1. Data dictionary of variables in the California mortality tables.
All Tables
Year Year of record
Month Month of record
County California county where death was recorded
Annual Mortality Table
annualCount Number of deaths for the sex specified
raceEthnicity Race and/or Ethnicity of the deceased; Latinx is unfortunately treated as a racial category here rather than an ethnic group
Specialized Variables
leadingCauseDeath Leading cause of death per death certificate; See separate table for encoding
ageBins Age groups where the age at death was known
ageDeaths Number of deaths for the age bin specified
biologicalSex Biological sex of decedent
sexDeaths Number of deaths for the sex specified
placeDeathOccurred Location of death, per death certificate
placeDeaths Number of deaths matching specified criteria
raceDeaths Number of deaths matching race / ethnic criteria
totalDeaths Number of deaths per month, year, & county
Table 2. Underlying cause of death category based on coding using the Tenth Revision of the International Classification of Diseases (ICD) codes
3-letter Code Description ICD-10 Codes
HTD Diseases of the Heart I00-I09, I11, I13, I20-I51
CAN Malignant Neoplasms (Cancers) C00-C97
STK Cerebrovascular Disease (Stroke) I60-I69
CLD Chronic Lower Respiratory Disease (CLRD) J40-J47
INJ Unintentional Injuries V01-X59, Y85-Y86
PNF Pneumonia and Influenza J09-J18
DIA Diabetes Mellitus E10-E14
ALZ Alzheimer’s Disease G30
LIV Chronic Liver Disease and Cirrhosis K70, K73-K74
SUI Intentional Self Harm (Suicide) U03, X60-X84, Y87.0
HYP Essential Hypertension and Hypertensive Renal Disease I10, I12, I15
HOM Homicide U01-U02, X85-Y09, Y87.1
NEP Nephritis, Nephrotic Syndrome and Nephrosis N00-N07, N17-N19, N25-N27]
OTH All Other Causes of Death All codes not listed above

3 The data for 2020 isn’t complete…

Ohhhh… that’s right. The pattern observed is simply a by-product of an incomplete record from 2020, isn’t it?! We only have 8 months’ of observations. You realize at this point you were so excited to jump into map-making that you missed something in how you were doing your calculations and/or planning your maps.

## Remove the extra characters from the name of the county
census <- census %>% 
  mutate(NAME = gsub(" County, California", "", NAME)) %>% 
  rename(County = NAME)

## Join with the total deaths
totalDeaths <- totalDeaths %>% 
  full_join(census, by = "County")

Let’s now look ook at annual and monthly trends in total numbers of deaths

Question 10 [1 point]:

Why did this happen and how could we fix it? (There is more than one correct answer possible.) Could we have made our maps this way and had them be accurate if we’d had a full 12 months’ of data for each year?

Your answer here.

3.1 Fixing the map’s accuracy.

Although there are several ways we could tackle this, for the map I am going to choose to take an average of monthly % mortality so that it no longer matters whether we have 8 months’ or 12 months’ of data available for any given year.

## Calculate the monthly percent mortality
mortalityMonthly_sf2 <- mortality_sf2 %>% 
  group_by(COUNTY_NAM, Year, Month) %>% 
  summarize(percMortality = 100*(sum(totalDeaths, na.rm = TRUE)/mean(totalN))) %>% 
  ## Note that I have to divide by the mean of the N from the Census or I'd get an error
  ungroup()

## Find the five number summary
summ <- fivenum(mortalityMonthly_sf2$percMortality)
print("The five-number summary of % mortality is:")
## [1] "The five-number summary of % mortality is:"
summ
## [1] 0.00000000 0.04780236 0.05833171 0.07265292 0.16220308

Question 11 [3 points]:

Now try executing a different way to more accurately portray the % mortality. (Hint: If you need an idea to get you started, ask yourself if you could filter the mortality_sf2 tibble in some way to restrict what is plotted.) What do you observe, and how do the results compare to the map I chose to make?

# Your code here

Your answer here.

4 Project future mortality

Your stakeholders have asked if you can quickly give them some idea of what to expect mortality will look like for the remaining months of 2020. Now, although this is not going to be the most robust method for time series projection available (a caveat!), you immediately tell me, “Oh, I could do a regression.” So, you do.

4.1 Regression predictions of mortality in 2020

mod <- lm(percMortality ~ COUNTY_NAM + Year + Month, data = mortalityMonthly_sf2)
# plot(mod)

## Make the new data:
## Get the names of the counties; repeat four times for the 4 months
COUNTY_NAM <- rep(unique(mortalityMonthly_sf2$COUNTY_NAM), 4)
## Make a set of months for each of the 58 counties for Sept, Oct, Nov, Dec
Month <- c(rep(9, 58),
            rep(10, 58),
            rep(11, 58),
            rep(12, 58))
Year <- rep(2020, length(Month))
newdata <- cbind.data.frame(COUNTY_NAM, Year, Month)

## Attach it to the dataframe and then call it the same name as the original Y
newdata$propDeaths <- predict(mod, newdata = newdata)
newdata$isPredicted <- "1"

Let’s finish out the year of 2020 with predictions (shown in bubble points):

Question 12 [1 point]:

Do you think this regression prediction is robust? Why or why not? (Hint: Check your assumptions with plot(mod)!!)

# Your code here.

Your answer here.

4.2 Showing the predictions on a map

5 Improving Predictions

By now, you have probably realized that the regression we did is neither adequate nor to be trusted. Your task is to improve the prediction. Lay out a brief analysis plan, defend your choice of machine learning algorithm, and attempt to update at least the map if not also the line graph. Be sure to include an RMSE or some other metric to show how your prediction is performing.

Wait, how? Ah, the pandemic is over in reality! You can find the actual number of deaths, per county, here.

Hint: You can read this file in as a dataframe using JSON API as follows!

deaths2020 <- fromJSON("https://data.chhs.ca.gov/datastore/odata3.0/579cc04a-52d6-4c4c-b2df-ad901c9049b7?$top=5&$format=json")

Remember to filter for the proper condition, Year, and Months to get the data you need for your testing set.

Question 13 [15 points]:

Your analysis plan here.

# Your code here. You can add more code chunks as needed to do the work asked!

Your conclusions here.

If this open-ended analysis feels overwhelming, just ask for more hints to get you started!

6 Testing Hypotheses

California was one of the strictest states when it came to shutdown measures during the COVID-19 pandemic, with both statewide enforcement and a duration longer than some other states in the country. California declared a state of emergency on March 4, 2020 with a a mandatory statewide stay-at-home order issued on March 19, 2020. This statewide order didn’t end until January 25, 2021, but it still wouldn’t be until April 6, 2021 when the state announced plans to fully reopen the economy by June 15, 2021.

Your stakeholder is actually not as interested in deaths from COVID-19, but actually more interested in how the mandatory stay-at-home order seems to affecting mortality due to certain conditions. For example, are people less likely to die from accident or injury since they’re under stay-at-home orders?

6.1 Death from Accident & Injury

Because of the mandatory COVID-19 shutdown in California, which has led to generally reduced movement throughout California, your stakeholder would like to know if the number of accidental deaths declined enough during COVID-19 shutdown to generate a signal.

HYPOTHESIS: I hypothesize that because of intense, mandatory stay-at-home measures, deaths attributed to accidents and injuries (INJ) declined during the COVID-19 shutdown during the months of 2020 in California.

So, let’s test it! We will start, as with all good analysis plans, with some more Exploratory Data Analysis.

6.1.1 Sex-specific Effects on INJ Mortality

Total Deaths by Sex

The leading causes of death is located in the sexDeaths table because that was how the data were provided to us. They are separated by sex, thus we can check to see if there are sex differences we should worry about before proceeding. We do need to join some of our tables and calculate some additional metrics as well, including:

  • Proportion of deaths from accident or injury out of all deaths (no longer adjusting for population size)

  • Ratio of adult females to males in the population

  • Ratio of females to males in the population among those over 65 years of age

  • Ratio of all adults over the age of 65 (retirement age) to those under retirement age

We have opted to adjust per total deaths, rather than population size, so that we can now better examine the fraction of deaths due to particular causes regardless of the underlying size of the population. However, we do need to account for the ratio of females to males in the population, as well as it might be useful now or in future analyses to have access to the ratio of females to males just among those over the age of 65 as well as the ratio of retirees (65+ years old) to non-retirees in the population.

## Join the sex deaths table with the census data; then filter for INJ
injDeaths <- sexDeaths %>% 
  full_join(census, by = "County") %>% 
  filter(leadingCauseDeath == "INJ") %>% 
  ## Now full join with the first 4 columns of the totalDeaths table to get total deaths also
  full_join(totalDeaths[ , 1:4], by = c("County", "Year", "Month")) %>%
  ## rename the sexDeaths column to something more appropriate
  rename(injuryDeaths = sexDeaths) %>% 
  ## Calculate the proportion of all deaths that are from injury RELATIVE TO ALL DEATHS (not population)
  mutate(deathsInjuriesProportion = injuryDeaths/totalDeaths,
         femaleMaleRatio = femalesN/malesN,   ## The ratio of females to males in the population
         femaleMaleRatio65 = `females65+`/`males65+`,  ## The ratio of 65+ females to males in the population
         retireeRatio = over65/totalN,  ## Ratio of retirees to non-retirees
         Date = as.Date(paste(Year, Month, "1", sep = "-"), format = "%Y-%m-%d"),   ## Convert the date to a date format for plotting
         deathsInjuriesProportion = ifelse(deathsInjuriesProportion == "NaN", 0, deathsInjuriesProportion))  ## Illegal div by 0 turned to NaN; turn back to 0

Question 14 [1 point]:

Is it fair to model the values without accounting for an effect of sex, do you think? Why or why not?

Your answer here.

Proportion of Deaths by Sex

What if we calculate the proportion of deaths, for each gender, instead? Wouldn’t that give us a more accurate portrait of any sex-level differences that might exist?

injDeaths %>% 
  group_by(Month, biologicalSex, Year) %>% 
  summarise(deathsInjuriesProportion = mean(deathsInjuriesProportion, na.rm = TRUE)) %>% 
ggplot(aes(x = Month, 
           y = deathsInjuriesProportion, 
           color = as.factor(Year), 
           linetype = as.factor(biologicalSex))) +
  geom_line(alpha = 0.7, 
            size = 1.1) +
  scale_x_continuous(n.breaks = 10) +
  scale_color_manual(values = c("#d45731", "cornflowerblue", "#925ebf", "#86a607")) +
  scale_linetype_manual(values=c("dotdash", "solid")) +
  labs(title = "Proportion all deaths due to accident or injury, by sex",
       y = "Proportion of Deaths Due to Accident or Injury", 
       subtitle = "Cumulative for all California counties, Jan 2017 - Aug 2020",
       color = "Year",
       linetype = "Sex of Decedent") +
  theme_minimal()
## `summarise()` has grouped output by 'Month', 'biologicalSex'. You can override
## using the `.groups` argument.

Question 15 [1 point]:

What do you notice about the trend? Is there anything compelling here potentially? Can you find any possible explanation for what happened in May (month 5) of 2020?

Your answer here.

6.1.3 Predictive Modeling

With the previous model, I did a subpar linear regression in the hopes that you would improve upon my predictive model. This time, since I am going to have you focus on the exploration side of things, I am going to attempt to make a slightly better model – but please be aware that this is NOT as robust as I would like it to be to return to stakeholders. This is more for demonstration purposes so that we can get to our true goal: mapping predictions!

I know from the earlier model diagnostic plots that the linear regression assumptions were not upheld. We have several options we could explore here to make a better model, including performing a Box-Cox transformation, considering a LASSO regression (especially since there could be high collinearity among some counties, especially those that are more urban vs. rural), or even switching to a regression algorithm that is agnostic to the underlying assumptions of Ordinary Least Squares, like a Random Forest. Here, the latter is exactly what we will implement - and we will add another ensemble method, Gradient Boosting Machine (GBM) for comparison.

Before we move forward, let’s also clarify our question a bit.

Research Question:

What effect has County, Year, Month, stayAtHome, biologicalSex, femaleMaleRatio, totalN, and retireeRatio had on the proportion of all deaths that have occurred due to accident or injury?

In other words, does the proportion of accident / injury-based deaths differ by the county where the decedent lived, the month in which the death occurred, whether it is during the COVID-19 stay-at-home order or not, the biological sex of the decedent, the sex ratio of adult females to males in the county, the population size of the county, and the ratio of older adults over 65 years relative to non-retirees?

Wait, why am I throwing retireeRatio in there? Ahh, we can think of reasons why accidental deaths might change over the duration of one’s life. Perhaps the relative proportion is unchanged and only the cause of the accidental death might change (i.e., more stunt-based deaths when younger, more falls when older) OR the actual proportion might decline with age because people are generally less active and move / drive around less.

6.1.3.1 Data cleaning

There really isn’t much we need to do here, but I will go ahead and strip the dataset down to just the columns we are going to focus on in the analysis as well as drop any missing values. I am not going to do any imputation of missing values this time because there are some counties for which we just have had no reporting. We don’t want to impute for a county that has no data at all.

injDeaths_cleaned <- injDeaths %>% 
  dplyr::select(County, Year, Month, stayAtHome, biologicalSex, femaleMaleRatio, totalN, retireeRatio, deathsInjuriesProportion) %>% 
  drop_na()

6.1.3.2 Data partitioning

Before we do any encoding or transformations, let’s do our test-train split as we’ve already discussed.

Remember that little function we annotated from last project, the one called calcSplitRatio? We are going to load that in and use that function again to help use find our optimum split ratio! I have put it into a R script, calcSplitRatio.R. Then, we need to pass it our dataframe, injDeaths as well as the number of parameters we expect to test. Here the number of parameters is the 58 counties in the dataset as well as the 6 other features (parameters here) we plan to use (Month, stayAtHome, biologicalSex, femaleMaleRatio, totalN, and retireeRatio).

source("calcSplitRatio.R")
train_prop <- calcSplitRatio(df = injDeaths_cleaned, p = (length(unique(injDeaths_cleaned$County)) + 7))  ## 58 counties + 5 more parameters
## [1] "The ideal split ratio is 0.88:0.12 training:testing"

Not too unsurprisingly since this is a smaller dataset, it is recommending a larger training set than we’ve seen before.

ind <- createDataPartition(injDeaths_cleaned$deathsInjuriesProportion,          
                          p = train_prop,                           
                          list = FALSE)
 
train <- injDeaths_cleaned[ind, ]
test <- injDeaths_cleaned[-c(ind), ]

6.1.3.3 Make one-hot encoded variables

Next, we need to do our encoding. We are going to do one-hot encoding for all of categorical variables of County, stayAtHome, and biologicalSex. We will manually make for the one-hot encodings for stayAtHome and for biologicalSex since there are only two conditions in each of those columns; the dummyVars() function will try to turn them into two columns, which is both unnecessary AND increases our dimensionality! But we will leverage dummyVars() from the caret package for the County variable to make our life so much easier.

Note: I have encoded males as 1 because it is the condition in which we expect a higher proportion of injury-related deaths. Thus, females have been made the reference condition. Additionally, I encoded the stayAtHome condition of None as 0, making it the reference condition.

encodeTrain <- function() {
## Make one-hot encoded variables manually
train <- train %>% 
  dplyr::mutate(stayAtHome = ifelse(stayAtHome == "None", 0, 1),
         biologicalSex = ifelse(biologicalSex == "Male", 1, 0))

## Now make the one-hot encoded County
county <- dummyVars(" ~ County", data = train)
trainReady <- predict(county, newdata = train) %>%   ## Do the encoding
  data.frame() %>%  ## Make a dataframe 
  cbind(train) %>%  ## Add back to the train set
  select(-County)   ## Drop the original county

return(trainReady)
}
trainReady <- encodeTrain()

encodeTest <- function() {
  
## Make one-hot encoded variables manually
test <- test %>% 
  mutate(stayAtHome = ifelse(stayAtHome == "None", 0, 1),
         biologicalSex = ifelse(biologicalSex == "Male", 1, 0))

county <- dummyVars(" ~ County", data = test)
testReady <- predict(county, newdata = test) %>%   ## Do the encoding
  data.frame() %>%  ## Make a dataframe 
  cbind(test) %>%  ## Add back to the test set
  select(-County)   ## Drop the original county

return(testReady)
}

testReady <- encodeTest()

Now, there’s one little problem… that’s a BIG problem. Not all of the counties have columns in the testReady dataset. Why not? Well, if the category was missing from the County column of the test data set, then the column would not be made by dummyVar()! So, we will need to make a blank column full of zeros for any columns missing from testReady. There is a handy function, setdiff(), that will allow us to compare two vectors (the names of the columns of the dataframe) and return what is present in the first vector but missing in the second vector.

fixMissingColumns <- function() {
  ## Store the missing columns into a vector
  columns_missing <- setdiff(names(trainReady), names(testReady))
  columns_missing

  ## Make the new columns and populate them with zeros
  testReady[, columns_missing] <- rep(0, nrow(testReady))
  print(paste0("Column comparison: Are they the same now in test and train now? ", 
               ncol(testReady) == ncol(trainReady)))
  
  return(testReady)
}

## Encode train again
trainReady <- encodeTest()

testReady <- fixMissingColumns()
## [1] "Column comparison: Are they the same now in test and train now? TRUE"

We can see that the testReady dataset now has the same number of columns as the trainReady dataset. Yay! Big little problem solved.

6.1.3.4 Center & Scale

This time, we will take advantage of the preProcess() function in caret as an alternative method. Last time, we used scale(). Notice that what preProcess() does is centers and scales relative to the training data and applies the same centering & scaling factors to the test set.

centerScale <- preProcess(trainReady, method = c("center", "scale"))

trainReady <- predict(centerScale, trainReady)
testReady <- predict(centerScale, testReady)

6.2 Your Hypothesis! (Rewrite this header to reflect)

Originally, I wanted you to test your own condition. Although I would still love for you to use these data on your own to explore the dataset further, to abbreviate this exercise and not overwhelm you with work before your final project, we will instead only make ONE map after we plan our question, hypothesis, and analysis.

Question 26 [2 points]:

Formulate a question based on these data that you would like to test relative to the effect of the stay-at-home order in California. Can you find any articles or reports to support this? (One link or citation required).

Your answer here.

Question 27 [1 point]:

Formulate a hypothesis based on your question.

Your answer here.

Question 28 [5 points]:

Formulate an analysis plan, including the maps you would make and the predictive algorithm you would use to test your question. Please give at least a few very descriptive sentences for full credit.

Your answer here.

Question 29 [5 points]:

Attempt to make a single map, as I did for the injury-related deaths relative the stay-at-home order. Test your question about the shutdown, visually and with a map.

# Your code here. Add more chunks if needed / desired.

Question 30 [1 point]:

What trends, if any, do you notice? Why do you come to this conclusion?

Your answer here.